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A simulation study of rater agreement measures 
with 2x2 contingency tables 

Manuel Ato*, Juan Jose Lopez and Ana Benavente 

Universidad de Murcia (Spain) 

A comparison between six rater agreement measures obtained using three 
different approaches was achieved by means of a simulation study. Rater 
coefficients suggested by Bennet’s <7 (1954), Scott’s 7T (1955), Cohen’s 
K (1960) and Gwet’s y (2008) were selected to represent the classical, 
descriptive approach, OC agreement parameter from Aickin (1990) to 
represent loglinear and mixture model approaches and A measure from 
Martin and Femia (2004) to represent multiple-choice test. Main results 
confirm that 71 and K descriptive measures present high levels of mean bias 
in presence of extreme values of prevalence and rater bias but small to null 
levels with moderate values. The best behavior was observed with Bennet 
and Martin and Femia agreement measures for all levels of prevalence. 


There are a lot of behavioral research applications where is needed to 
quantify the homogeneity of agreement between responses given by two (or 
more) observers or between two (or more) measurement devices. With 
responses in a numerical scale, the classical intraclass correlation coefficient 
(Shrout & Fleiss, 1972; McGraw & Gow, 1996) and more recent 
concordance coefficient (Lin et al, 2002) are the most frequently used 
alternatives, and it has been demonstrated that Lin’s coefficient can be 
estimated by means of an special intraclass coefficient (Carrasco & Jover, 
2003). With measures in a categorical scale there are a greater variety of 
options (Shroukri, 2004, von Eye and Mun, 2005). In both cases the 
methodological scenario is matched pairs where two o more raters classify 
N targets on a categorical scale with M categories producing a M x M 
contingency table also known as agreement table (Agresti, 2002). 
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Three general approaches to rater agreement measurement for 
categorical data leads to three different fonns of agreement coefficients 
(Ato, Benavente y Lopez, 2006). A first, descriptive approach started from a 
suggestion of Scott (1955) to correct from observed agreement the 
proportion of cases in which agreement occurred only at random. Several 
descriptive measures have been defined within this approach. In this paper 
the interest will be focused on the ^-coefficient (Scott, 1955), on a proposal 
firstly suggested by Bennet et al. (1954), reused afterwards with different 
names and relabeled as a -coefficient (see Zwick, 1988; Hsu & Smith, 
2003), the classical k -coefficient (Cohen, 1960) and a more recent proposal 
of Gwett’s y -coefficient (2008). Differences between these measures are 
relative to the particular definition of chance correction. 

Loglinear and mixture modelling is a second approach which is used 
when the focus is the detailed examination of agreement and disagreement 
patterns (Tanner & Young, 1986, Agresti, 1992). Mixture modelling is a 
generalization of loglinear approach with an unobserved (categorical) latent 
variable. The set of targets to be classified is assumed to be drawn from a 
population that is a mixture of two subpopulations (latent classes), one 
related to objects easy to classify by both raters (systematic agreement) and 
the other to objects hard to classify (random agreement and disagreement). 
Within this approach it is also possible to reproduce all the descriptive rater 
measures (Guggenmoos-Holtzman & Vonk, 1998; Schuster & von Eye, 
2001, Schuster & Smith, 2002; Ato, Benavente y Lopez, 2006) and also to 
define new rater agreement measures as a -coefficient (Aickin, 1990). 

A third alternative approach is inspired in the tradition of multiple- 
choice test and developed in order to overcome the limitations shown for 
many descriptive measures. Within this tradition, Martin & Femia (2004, 
2008) defined a new measure of agreement, the A-coefficient, as ‘the 
proportion of agreements that are not due to chance’. 

In this paper we use a simulation study to compare the behavior of 
these rater agreement measures for 2 (rater) x 2 (categories) agreement 
tables. The rest of this paper is organized as follows. The second section 
comments the main notation and fonnulas used for descriptive, loglinear 
and mixture rater agreement measures. The third section describes with 
detail the simulation process of this study. The final section shows the main 
results obtained and some implications for research practice. 
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Rater agreement measures 


Notation 

Let A and B denote 2 raters classifying a set of targets into M categories, 
with responses i = 1,..., M for observer A and j = 1,. .M for observer B. In 
this work we confine our interest to the 2 (raters) x 2 (categories) agreement 
table case. Let { 7 ^} the joint probabilities of responses in i row and j column 
given for both raters, and {7t i+ } and {re,} the row and column marginal 
distributions resulting of summing the joint probabilities where 

E V =X> + y =££** = !■ 

' j ' j 

Given a sample of N objects to be classified, Table 1 summarizes the 
notation used in this paper characterizing four cells inside the agreement 
table: p n = n u / N refers to the proportion of responses of both raters for 
first (or positive) category, p 22 = n 22 IN to the proportion of agreement 
responses for second (or negative) category, and p l2 = n l2 / N and 
p 2] = n 2] / N are proportions of disagreement responses between raters. 
Similarly, p l+ = n l+ / N and p 2 _ = n 2 _ / N are marginal proportions for 
both categories corresponding to rater A and p +1 = n +1 IN and 
p +2 = n +2 1N are marginal proportions for rater B. 


Table 1. Joint and marginal proportions for the 2x2 agreement table. 

Rater B 


Rater A 

1 

2 

Marginal A 

1 

P11 

P12 

Pi+ 

2 

P21 

P22 

P 2+ 

Marginal B 

P+i 

P +2 

p++ 
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Descriptive measures 

A simple formula to measure agreement is the observed proportion of 
agreement which is the sum of diagonal cells of Table 1. For the 2x2 
agreement table, p o = P n + P n ■ This formula is a common component in 

all measures of agreement considered in this paper but it fails to point us 
how much of observed agreement is due to chance. Thus the concept of 
“random corrected agreement” (RCA) is the basic notion that pervades the 
practical utilization of descriptive measures on agreement. 

A general measure of RCA is 

p — p p p 

RCA = —-TiL = _£J2- Ce_ (i) 

1 - p 1 — p l-o 

1 e 1 e 1 e 

where p e is chance agreement proportion and p - p corrects the excess 
of observed agreement proportion that is assumed to be computed with p o . 
The difference must be weighted with 1 — p , the possible maximum of non 
chance agreement proportion. The values of RCA are constrained to lie 
inside the interval [-p /(1-p); + p /(1-p )], where 

RCA = -p / (1 - p) > -1 is the lower limit associated with perfect 
disagreement, RCA-0 means that observed agreement is equal to chance 
agreement probability and RCA = p /(1-p ) < 1 is the higher limit 

associated with perfect agreement that is only possible when chance 
agreement proportion is zero. 

The four descriptive measures used in this study are based on the 
general formula (1) and assume independence between raters in the 
classification process. Differences between them are due to the specific 
definition of chance-corrected agreement p e . 

A first solution was originally proposed by Bennet et al. (1954) by 
using as chance correction fonnula a fixed value, the inverse of the number 
of categories. This solution has been relabeled as a -coefficient, and 

chance-corrected proportion as p a e =\!M. More recently, a -coefficient has 

become a consolidated agreement measure and reconsidered in works such 
as Holley & Guilford (1964), Janson & Vegelius (1979), Maxwell (1977) 
and Brennan & Prediger (1981). See Zwick, 1988 and Hsu & Field, 2003, 
for a more detailed explanation of how researchers have been using different 
names for the same procedure. This form of chance correction assumes that 
observers unifonnly classify targets in categories, and then is based on an 



A simulation study of rater agreement measures 


389 


uniform distribution of targets. For M = 2 categories, ff = .5, and using (1) 
a -coefficient can be estimated as 


(7 


0.5 0.5 


2 (Pn + P2i)- [ 


( 2 ) 


A second solution was proposed by Scott (1955) using the squared 
mean of row and column marginal proportions for each category as chance- 
corrected agreement. Defining the mean of marginal probabilities for each 
category simply as p x — (p l+ + p +l )/2 and p 2 = (p 2+ + p +1 )/2, then 


Pi = P\ + P\ ( 3 ) 

This fonnula assumes that observers classify targets using a common 
homogeneous distribution. The resulting ^■-coefficient can be estimated 
using (T) as 


A third solution was proposed by Cohen (I960) using as chance- 
corrected formula the sum of products of row and column marginal 
probabilities for each category 


Pe = PnP + i + P 2+ P + 2 ( 5 ) 

and so assumes that each observer classify targets using his/her own 
distribution. The resulting k -coefficient is also estimated using (1) as 


~P e 


Research literature (see Feinstein & Cichetti, 1990; Byrt, Bishop & 
Carlin, 1993; Agresti, Ghosh & Bini, 1995; Lantz & Nebenzahl, 1996 and 
Hoehler, 2000, among others) reports that k - and n -coefficients have two 
main limitations as a direct consequence of using these chance-corrected 
agreement measures: prevalence problems refer to the particular distribution 
of data across the categories and arise in presence of extreme values for one 
of the two categories and rater bias problems appear particularly with 
extreme marginal distributions of two raters. 

More recently, Gwett (2001, 2008) proposed a RCA formula that 
seems to be more stable under certain conditions and uses the mean of 
marginal probabilities for each category simply as chance-corrected 
agreement 
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p:=^p 2 ( 7 ) 

and the resulting g -coefficient is estimated using (1) as 


l ~Pj 

Descriptive measures, and k -coefficient in particular, are very 
popular between researchers of behavioral and social sciences. The 
interpretation of these measures has been generally focused as a classical 
null hypothesis of RCA equal zero (see Fleiss, Levin and Paik, 2003). It has 
been shown that they can also be understood as a restricted loglinear or 
mixture model within the QI family (see Ato, Benavente y Lopez, 2006). 


Loglinear and mixture model measures 

Given a set of N targets to be classified by 2 raters in 2 response 
categories, loglinear models distinguish between components such as 
random expected agreement and non random expected agreement and so 
they can evaluate the fit of the model to the data (von Eye & Mun, 2005). 
The basic, starting model representing random expected agreement is 

independence model, log(m ) = A + Af + A B . , where m^ are expected 
values, A: 4 and A B are individual rater effects and the model releases 

< j 

M -2M + 1 residual degrees of freedom with M being the number of 
categories. Because rater agreement is concentrated on diagonal cells, a 
more appropriate model is the quasi-independence (QI) model , 

log (m.j ) = A + A, f + A j + 5. (9) 

where 5. is a diagonal parameter that represents systematic agreement for 

z'-th category. This model releases M“-3M + 1 residual degrees of 
freedom and so cannot be directly estimated with 2x2 agreement tables. 

QI model is directly related with the concept of RCA. A general 
formula to obtain a model-based agreement measure, which allows 
correcting from observed agreement a component due to chance, can be 
defined (see Guggenmoos-Holtzmann, 1993; Guggenmoos-Holtzmann and 
Vonk, 1998; Ato, Benavente and Lopez, 2006) by: 
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RCA(QI) = ^ 

i 

where p.. is the observed agreement proportion in 7-th diagonal cell and 
exp(<5.) 

' is the estimation of exponential transfonnation of 8 i diagonal 

parameter in the QI model (9). As it can be seen, the framework of equation 
(10) is very similar to RCA equation (1) and the range of possible values 

spreads from when there is no disagreement, to 

i 

-'Y j p u / exp(A) > -1, when agreement is null. 

i 

QI model is the most general of a family of models whose members 
can be defined applying some restrictions on the basic model. A basic kind 
of restriction is the constant quasi-independence (QIC) model (Ato, 
Benavente and Lopez, 2006), 

log;(m~) = X + Af + 1 B . + S (11) 

where c> represents systematic agreement and is assumed constant for all 
categories. This model, which was firstly proposed by Tanner & Young 

(1996), releases M -2 M residual degrees of freedom being saturated for 
2x2 tables and can be defined using a similar fonnulation to (TO) by 

RCA(QIC) = Z 

i 

A generalization of loglinear models including a latent variable leads to 
latent-class mixture models where targets to be classified are assumed to 
come from a population representing a mixture of two finite subpopulations. 
Each subpopulation identifies a cluster of homogeneous items, one related 
with easy to classify items where both raters give the same response 
(systematic agreement), and the other related with items of difficult 
classification where both raters give a random response (random agreement 
and disagreement). This generalization change the status of M x M 
agreement table into a 2 x M x M table, but it hardly affect to the loglinear 
model required to estimate its parameters. Assuming a QI loglinear model, 
the representation of a QI mixture model is very similar to (9), 


exp(<5) 


( 12 ) 


Pu ~ 


Pu 


exp(£.) J 


( 10 ) 
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log (m. jk ) = A + Af +A B + (13) 

where mixture ^.parameters are related with loglinear ^.parameters 
(Guggenmoos-Holtzman, 1993, Ato, Benavente & Lopez, 2006) by 
exp(^) = exp(A) -1 (14) 

and the new subscript k indicates the re-dimensionality of agreement table 
due to the latent variable. 

A popular agreement measure derived from a mixture model is a 
(Aickin, 1990) and its equivalent loglinear model is QIC (11). The 
representation of QIC mixture model is 

log (m.. k ) = A + Af + A b + g (15) 

where again the correspondence (14) allows to estimate a QIC mixture 
model by means of its equivalent loglinear model using (12). For the case of 
a 2 x 2 agreement table, Guggenmoos-Holtzman (1993: 2194) also showed 
that constant parameter cxp(A)could also be estimated using the odds ratio 

exp(<5) = yl(p n p 22 ) / (p 2X P n ) (16) 

and the agreement measure can be obtained using (12). 

Multiple-choice test measures 

Martin & Femia (2004) proposed a new rater agreement measure, 
called Delta (a), which was developed in the context of multiple-choice test 
where a student has to choose among one of M possible responses for each 
of N targets known to the evaluator. If a student knows a fraction (say 
A =.4) of responses and fill out all the test, then it is assumed that the 40% 
of responses will be accurately recognized and the other 60% will be 
classified at random. The response model postulated for this situation (see 
Martin and Luna, 1989) is that the student will give a correct reply if the 
response is known and will pick a response at random if the response is 
unknown. In this case A is really a measure of the conformity of student 
with evaluator. 

A-coefficient, as will be used in this paper, is the generalization of the 
situation of multiple choice-test where there is a sole object, to the situation 
of agreement where there are several objects and the intensity of 
recognitions for each object need not be necessarily the same (Martin and 
Femia, 2004). 
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Although inspired in a different tradition, in practice ^-coefficient 
seems to reproduce exactly the same results than that obtained from 
RCA(QI) model (10) for the general M x M case. But for the 2x2 
agreement table, as was pointed before, A and RCA(QI) cannot be 
estimated. Nevertheless Martin & Femia (2004: 9) suggested an simple 
asymptotic approximation of a -coefficient fonnulated as 

A = (/> u + ^22) ( 17 ) 

which can be used as a consistent rater agreement measure in this context. A 
more recent asymptotic re-formulation of A-coefficient (Martin & Femia, 
2008: 766), which is specially useful for contingency tables with zero cells, 
is a simple modification of equation (16) which can be obtained in practice 
after adding the constant 1 to all the observed frequency cells of a 
contingency table, but it will not be used in this paper. 

To see all selected measures in action with an agreement table for a sample of 
N=100 targets, where observed frequencies for 11-12-21-22 cells are 81-2-8-9, and 
values for descriptive agreement measures were: <x = 800, (eq. 2), n = 585, (eq. 3- 
4), is, = 588 (eq. 5-6), 7 = 868 (eq. 7-8), for loglinear/mixture agreement measure is 
d=.744(eqs. 15-16) and for multiple choice-test agreement measure is A = 820 
(eq. 17). These big differences between rater agreement measures with the same 
purpose justify a simulation study. 


A SIMULATION STUDY 

A simulation study with these descriptive, loglinear/mixture and 
multiple-choice test rater measures for 2x2 agreement tables was 
performed varying parameters of prevalence of first or positive category 
(PCP, from 0.10 to 0.90 in steps of 0.10), discrimination probability (DP, 
from 0.50 to 0.90 in steps of 0.10) and number of targets (N of 30, 100 and 
300). DP is a key variable that is used to capture the random responses of 
raters. It refers to the ability of raters to discriminate between easy and 
difficult to classify targets and so allows distinguishing between reliable 
(with high DP) and unreliable (with low DP) raters. So a total of 3 (number 
of targets, N) x 9 (levels of PCP) x 5 (levels of DP) combinations were 
generated for this simulation study. 
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The model of simulation process borrowed the essential aspects from 
the latent mixture and multiple-choice test models, but it also has some 
peculiarities added in order to refine the random behavior of raters in the 
choice of categories. 

The process is initiated fixing the number N of targets, a prevalence 
value for the first or positive category value PCP and a discrimination 
probability DP. 1000 empty 2x2 agreement tables were generated for each 
combination of N, PCP and DP. In all cases, we assumed that 
discrimination probability was the same in both observers, a reasonable 
assumption if raters receive the same training as is usual with many rater 
agreement studies. Given both values of prevalence and discrimination, the 
average proportion of p n would be PCP*DP+(l-DP)/4, which can be 

partitioned in a portion due to systematic agreement, (PCP*DP), and 
another due to random agreement, (l-DP)/4, whereas the average proportion 
of p 22 would be (l-PCP)*DP+(l-DP)/4, also partitioned in a portion of 

systematic agreement, (1-PCP)*DP, and another of random agreement, 
(l-DP)/4. 

For each case of an empty agreement table, the element was assigned 
to the positive or negative category (depending of the fixed PCP), and a 
random number R between 0 and 1 was generated and evaluated. If R £ 
DP, the case was included inside of discrimination range being easy to 
classify and so was considered as correctly classified. If R > DP, the case 
was considered as difficult to classify and so it was randomly assigned to 
any one of the four cells. In some cases where was detected the presence of 
zero values for one or more cells in an agreement table, the table was 
deleted and proceeding with the following. A few cases occurred in 
particular with extreme values of DP and number of targets of N=30 (see the 
total number of tables used for all combinations of PCP and DP in Table 2). 
A flow diagram of the simulation process is represented on Figure 1. 

Since the raw observed agreement (ROA) is the sum of proportions 
for responses to diagonal cells of each agreement table, the simulation 
process distinguished between two components of ROA: systematic 
agreement (SA), as the non-random proportion of easy to classify targets 
and so correctly classified in the diagonal cells, and random agreement 
(RA), as the proportion of difficult to classify targets that were randomly 
classified in the diagonal cells. 
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Figure 1. Flow diagram of the simulation process. 


RESULTS 

For each agreement table, the response of interest was focused on 
absolute bias , defined as the absolute difference between systematic 
agreement proportion (SA) and estimates for each one of six rater agreement 
measures. Absolute bias was firstly evaluated for all levels of discrimination 
(DP), prevalence of positive category (PCP) and number of targets (N), but 
due to the high similarity of results for different number of targets, all data 
from three options were merged. Table 2 shows the absolute mean bias 
(with standard errors in brackets) of six rater measures for a selected 20 of 
the 45 combinations of PCP and DP and the number of 2 x 2 contingency 
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tables used for each combination with all merged data after discarding tables 
containing zero values on any of the four cells. 

Three different sets of rater agreement measures emerged. All of them 
were easily detected at all levels of discrimination probability. Ordered from 
greater to smaller mean bias, the first set was integrated with Cohen’s k and 
Scott’s n coefficients, the second one included Gwet’s g and Aickin’s a , 
and the third with Bennet’s a and Martin’s A coefficients. 

First and second sets scored similar absolute mean bias results with 
moderate prevalence levels ranged between .30 and .70, but diverged from 
these stable levels with more extreme prevalence values. We also noted that 
all rater measures presented a moderate deviation from a stable performance 
inside the range 30-70, but deviation was more marked with k and n set, 
than with a and g set whereas a and A set were near of unbiased 
perfonnance. 

The behavior of the first set was the most biased of all rater agreement 
measures evaluated, particularly with low and high prevalence levels (see 
Figures 1 to 4). This result confirms a solid finding of research literature 
concerning the high sensibility of k (and n) coefficients to trait prevalence 
in the population and reinforces the recommendation to avoid using any of 
both coefficients as reliable rater measures with extreme prevalence levels 
(Byrt, Bishop and Carlin, 1993; Feinstein and Cichetti, 1990; Lantz and 
Nebenzahl, 1996). 

The behavior of the second set was less biased than the first set, 
particularly with higher levels of prevalence but the bias was the same with 
intermediate levels of prevalence for all DP levels. We were disappointed 
in particular with the behavior of Aickin’s a -coefficient, but as was pointed 
before, the assumption of homogeneity of x parameter is a restrictive 
assumption for 2 x 2 agreement tables. 

The behavior of the third set was excellent, scoring next of null mean 
bias for all levels of prevalence of positive category and for all levels of 
discrimination probability used. Both crand A coefficients may be 
considered as unbiased rater agreement coefficients with 2x2 agreement 
tables. Although differences between measures of third set were negligible, 
the best behavior was related in all cases with the asymptotic approximation 
of Martin and Femia’s A. a coefficient had an excellent behavior with 
2x2 agreement tables, but due to the unifonn distribution of targets we 
suspect that it cannot be extrapolated to agreement tables of higher 
dimensionality where unifonnity feature could be severely penalized. 



A simulation study of rater agreement measures 


397 


Comparing the behavior of rater agreement measures between levels 
of discrimination probability, a noticeable result was that mean bias was 
higher in the range DP=.70-.80 than for DP < .69 or DP > .81, particularly 
with extreme levels of prevalence, and with first and second sets of rater 
measures, a strange result that deserve to be further investigated. 


Table 2. Absolute mean bias (standard error in brackets) and number 
of tables used for selected combinations of discrimination (DP) and 
prevalence (PCP) for all agreement measures. 
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Figure 2. Behaviour of rater measures for DP=.60 


Discrimination probability: 70 



Figure 3. Behaviour of rater measures with DP = .70 
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discrimination probability: 80 



Figure 4. Behaviour of rater measures with DP = .80 


discrimination probability: 90 



Figure 5. Behaviour of rater measures with DP = .90 
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RESUMEN 

Un estudio de simulacion de medidas de acuerdo entre observadores 
para tablas de contingencia 2x2. Mediante un estudio de simulacion se 
aborda una comparacion entre seis medidas obtenidas usando tres enfoques 
diferentes para la evaluacion del acuerdo. Los coeficientes de acuerdo 
elegidos fueron <7 de Bennet (1954), K de Scott (1955), K de Cohen 
(1960) y y de Gwet (2001; 2008) para representar el enfoque clasico 
descriptivo, el coeficiente a de Aickin (1990), para representar el enfoque 
de los modelos loglineal y mixtura (“mixture models”) y la medida A de 
Martin and Femia (2004) para representar el enfoque de los test de eleccion 
multiple. Los resultados obtenidos confirman que los coeficientes K y K 
presentan diferencias notables en relacion a los restantes coeficientes 
particularmente en presencia de valores extremos de prevalencia y sesgo 
entre observadores. El mejor comportamento fue observado con los 
coeficientes <J de Bennet y A de Martin and Femia para todos los valores 
de prevalencia y sesgo entre observadores. 
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